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Architecture  Studies  and  System  Demonstrations  of 
Optical  Parallel  Processors,  for  A I  and  N1 


During  the  contract  period  we  have  studied  architecture,  algorithm,  and  system  issues  per¬ 
taining  to  the  implementation  of  optoelectronic  technology  for  Artificial  Intelligence  (AI)  and 
Neural  Intelligence  (NI).  As  a  result,  we  have  developed  the  Programmable  Opto-Electronic 
Multiprocessor  (POEM)  system.  We  have  demonstrated  the  superiority  of  the  POEM  architec¬ 
ture  over  VLSI  and  other  optical  systems  in  many  applications.  We  have  developed  or  modified 
parallel  AI  algorithms  for  efficient  implementation  on  POEM.  Finally,  we  are  currently  assem¬ 
bling  a  prototype  POEM  system  and  subsystems  necessary  for  future  POEM  systems. 

I.  ARCHITECTURE  STUDIES 

Effective  AI  and  NI  applications  require  fine  grain,  massive  parallelism  with  dense, 
reconfigurable  interconnects.  Free-space  holographic  optical  interconnects  between  modulators 
and  detectors  offer  faster,  higher  density  connections  with  less  energy  consumption  than  elec¬ 
tronics  for  connections  longer  than  a  certain  break-even  length  [1].  Furthermore,  free-space  opti¬ 
cal  interconnects  are  immune  to  crosstalk.  In  addition  to  improving  packaging  density,  optoelec¬ 
tronics  relieves  pin  constraints  imposed  by  VLSI  technology,  allowing  parallel  data  loading 
Efficient  fault-tolerance  techniques  using  reconfigurable  optical  interconnects  can  be  imple¬ 
mented,  since  the  functioning  processing  elements  are  not  permanently  coupled  to  a  fixed  inter¬ 
connection.  We  have  developed  the  Programmable  Opto-Electronic  Multiprocessor  (POEM) 
architecture,  shown  in  Figure  1,  to  exploit  these  advantages  offered  by  optoelectronic  technol¬ 
ogy* 

The  POEM  architecture 

POEM  is  a  highly  parallel  architecture,  based  on  wafer-scale  integration  (WSI)  of  optoelec¬ 
tronic  processing  elements  (PEs),  reconfigurable  free-space  optical  interconnects,  and  3-D  opti¬ 
cal  memory.  POEM  can  suppon  any  variation  in  the  parameters  commonly  used  to  classify 
parallel  systems  such  as  granularity,  topology,  and  synchrony.  Unlike  conventional  parallel  sys¬ 
tems,  POEM  is  not  limited  to  fixed  interconnection  topology  among  the  processors.  Instead, 
reconfigurable  optical  interconnects  provide  the  topologies  that  best  fit  the  present  algorithm. 
Fine  grain  POEM  systems  are  e.specially  effective  for  the  rapid  execution  of  .symbolic  informa¬ 
tion  processing  tasks  and  graph  algorithms  becau.se  of  the  programmability  of  the  optical  inter¬ 
connections  and  the  large  number  of  simple  PEs.  In  particular,  it  is  expected  to  offer  extremely 
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high  performance  for  semantic  networks,  production  systems,  knowledge  and  relational  data 
bases,  and  optimization  problen  s. 

Comparison  to  Symbolic  Substitution 

We  have  compared  POEM  to  Symbolic  Substitution  (SS)  in  terms  of  computational 
efficiency,  speed,  size,  energy  utilization,  programmability  and  fault  tolerance  [2].  We  have 
found  that  POEM  offers  advantages  over  SS  in  each  of  these  areas.  For  example,  space- 
invariant  SS  systems  are  equivalent  to  a  2-D  mesh-connected  architecture,  which  is  computa¬ 
tionally  inefficient  in  many  applications.  Though  the  use  of  multiple  and  complex  substitution 
rules  mediates  against  this  limitation,  such  rules  exact  on  heavy  penalty  in  system  power  dissipa¬ 
tion,  size,  and  speed.  We  have  categorized  the  complexity  of  SS  rules  and  determined  the  rela¬ 
tions  between  complexity  and  these  parameters.  The  progri.mmability  of  a  digital  computer  is 
closely  associated  with  its  ability  to  implement  random  access  memory  (RAM).  Space-invariant 
SS  does  not  provide  an  efficient  means  of  implementing  RAM,  limiting  its  applications  and 
increasing  progra;  iming  complexity.  Lastly,  routing  around  faults  in  SS  leads  to  additional  con¬ 
straints  in  the  lavout  of  a  computation.  POEM  alleviates  these  problems  by  using  energy 
efficient  local  electronic,  and  global  optical  connects.  Each  PE  has  local  RAM  to  provide  pro¬ 
gramming  ease,  while  reconfigurable  space-variant  optical  interconnects  allow  for  efficient 
fault-tolerance. 

Comparison  to  VLSI 

.  We  have  also  compared  the  POEM  architecture  with  conventional  VLSI-based  computers. 
We  have  found  that  VLSI,  although  successful  in  medium  grain  MIMD  computers,  has  had  rela¬ 
tively  little  success  in  the  area  of  general  purpose,  massively  parallel  SIMD  computers.  For 
example,  the  Connection  Machine  2  has  64K  processing  elements  and  is  a  general  purpose  fine 
grain  parallel  computer,  but  its  cost  and  size  has  limited  the  scope  of  application.  In  fine  grain 
parallel  computers,  a  large  number  of  processing  elements  are  required  to  solve  interesting  prob¬ 
lems.  Using  fixed  interconnection  networks  and  routing  implies  that  the  message  latency 
increases  with  increasing  number  of  PEs.  Thus,  in  fine  grain  parallel  VLSI  computers,  the  com¬ 
munication  overhead  can  slow  down  the  computation.  Some  progress  has  been  made  in  VLSI- 
based  mesh-connected  computers  that  do  not  use  routing,  however,  these  machines  perform  well 
only  on  a  limited  class  of  specific  problems,  while  the  packaging  technology  used  to  create  them 
makes  the  overall  system  high  in  cost  and  size.  The  Hughes  3-D  WSI  architecture  has  many 
features  in  common  with  POEM.  By  stacking  wafers  and  interconnecting  them  in  a  3-D  mesh, 
with  the  third  dimension  serving  as  a  bus,  the  Hughes  architecture  accomplishes  many  of  the 
same  performance  characteristics  expected  in  POEM.  However,  the  constraint  of  mesh  topology, 
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power  dissipation,  and  yield  management  limit  the  performance  of  3-D  WSI  below  that  expected 
from  POEM. 

II.  ALGORITHM  STUDIES 

In  order  to  determine  the  architectural  requirements  of  an  optoelectronic  AI  system,  we 
have  examined  existing  parallel  AI  models  as  well  as  developed  new  parallel  algorithms.  Our 
studies  have  concentrated  on  connecrionist  AI  systems  because  their  basic  structure  of  many 
very  simple  processing  elements  can  be  directly  mapped  onto  the  POEM  architecture.  We  have 
found  that  parallel  matrix-algebraic  formulations  offer  vast  improvements  in  operational  speed 
over  sequential  methods  for  many  AI  problems.  For  large,  sparsely  connected  graphs,  the  space- 
bandwidth  requirements  of  a  matrix-algebraic  fonnulation  may  be  prohibitive.  In  this  case, 
graph  edges  may  be  encoded  as  reprogrammable  optical  interconnections  between  processing 
elements. 

Consistent  Labeling; 

Many  tasks  in  AI  can  be  seen  as  constraint  satisfaction  problems.  Finding  correct  solutions 
to  sucIj  problems  involves  searching  through  a  large  number  of  possibilities.  The  method  of  Con¬ 
sistent  Labeling  can  extensively  prune  the  search  tree  of  a  constraint  satisfaction  problem  to 
allow  for  efficient  determination  of  solutions.  We  have  developed  a  highly  parallel  Consistent 
Labeling  algorithm  which  maps  well  onto  optoelectronic  architectures  f3]. 

Our  algorithm  uses  a  matrix  encoding  which  is  based  on  the  algorithm  conventionally  asso¬ 
ciated  with  Consistent  Labeling  but  larger  in  size.  By  using  more  space,  we  have  significantly 
reduced  the  operating  time.  Our  algorithm  uses  Boolean  vector  products,  which  require  global 
communications  but  only  limited  dynamic  range,  making  them  well  suited  for  optoelectronic 
implementation.  The  algorithm  is  capable  of  handling  ternary  and  quaternary  constraints,  as  well 
as  obtaining  k-consistency  for  any  value  of  k,  where  k  represents  the  depth  to  which  data  is 
analyzed  for  inconsistencies. 

Expert  Systems; 

The  data  base  search  for  matching  in  deterministic  AI  problems,  such  as  those  encountered 
in  expert  systems,  often  causes  serious  bottlenecks  when  performed  on  sequential  electronic 
computers.  These  bottlenecks  arise  due  to  the  exponential  nature  of  the  serial  search  process, 
which  can  be  alleviated  by  parallel  search.  We  have  developed  a  matrix-algebraic  formulation  of 
the  search  process  to  acheive  this  parallelism  [4]. 
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In  our  formulation,  the  knowledge  base  of  the  expert  system  is  stored  in  a  set  of  Boolean 
matrices.  Each  mai.nx  represents  a  particular  attrihiue  between  ordered  pairs.  Inference  and 
learning  is  acheived  using  simple  matrix-matrix  multiplication,  matrix  intersection  and  matrix 
composition  operations.  Global  searches  are  performed  in  parallel  by  stating  the  desired  attribute 
relations  in  a  matrix  equation. 

r^ETL: 

We  have  examined  the  feasibility  of  implementing  the  NETL  knowledge-base  system  on 
the  POEM  architecturs[NETL].  NETL  is  capable  of  performing  search  operations  on  die 
knowledge-base  in  near  constant  time,  independent  of  the  size  of  the  knowledge-base.  In  NETL, 
knowledge  is  stored  as  a  pattern  of  interconnections  between  many  simple  processing  elements 
which  allows  quick  parallel  searches  to  be  performed  on  the  knowledge-base. 

We  have  found  that  the  fine-grain  POEM  architecture  is  well  suited  to  NETL  because  of  the 
large  number  of  processing  elements  and  programmable  optical  interconnects.  Knowledge  is 
encoded  in  the  optical  interconnections,  which  can  change  as  new  knowledge  is  added.  An  arbi¬ 
trary  number  of  interconnections  from  a  given  node  can  be  acheived  if  extra  processing  ele¬ 
ments,  called  fan-out  units,  are  used.  We  have  shown  that  the  availability  of  arbitrary,  repro¬ 
grammable  optical  interconnections  offers  a  distinct  advantage  over  electronic  implementations 
of  NETL. 

III.  OPTICAL  STORAGE  STUDIES 

The  POEM  architecture  requires  3-D  optical  storage  and  reconfigurable  interconnections 
between  2-D  processing  arrays.  Towards  this  end,  we  have  experimentally  characterized  the 
photorefractive  behavior  and  measured  the  holographic  storage  capacity  of  Cerium-doped 
SBN:60. 

SBN;60  Characteri/ation; 

Volume  holographic  recording  in  photorefractive  crystals  is  important  Tor  inform-ation 
storage  and  interconnection  applicatio^^  Strontium  Barium  Niobate  (SBN)  provides  exc.'llent 
long-term  storage  and  wave-mixing  cnaracteristics,  with  high  index  modulation  anc.  moderate 
sensitivity.  SBN’s  :ording  sensitivity  and  diffraction  efficiency  increase  substantially  whi'n  an 
e.xternal  field  is  applied  along  the  grating  wave  vector.  The  applied  field  affects  the  writing  and 
erasing  processes  differently.  The  resulting  asymmetry  can  increase  the  capacity  of  SBN,  since 
previously  stored  holograms  become  more  resistant  to  erasure  by  additional  superimposed 
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holograms. 

In  order  to  make  accurate  predictions  on  the  number  of  holograms  which  can  be  stored,  we 
have  experimentally  measured  the  behavior  of  Cerium-doped  SBN:60  (Sro^^BaQ^NbzOf^) 
under  applied  field,  and  compared  these  results  to  theory  [5j.  We  measured  the  field  dependence 
of  the  recording  and  erasing  response  time  and  sensitivity,  gain  coefficient,  and  steady-state  dif¬ 
fraction  efficiency.  The  data  followed  the  predictions  of  Kuktarev’s  band  transport  model. 
Using  this  model  and  our  experimental  data,  we  were  able  to  predict  the  holographic  storage 
capacity  of  SBN.  Figure  2  shows  the  number  of  holograms,  Ni  as  a  function  of  Ea  for  a  \  mm 
thick  SBN:60  crystal  with  7]  =  10%,  5%,  and  1%.  For  77  =  1%,  N  increases  from  0(1)  at  £a  =  0 
to  0(1  (X))  at  10  to  20  KV/cm.  Preliminary  experiments  in  a  1  mm  thick  crystal  support  these 
predictions,  although  the  maximum  number  of  stored  holograms  was  about  30%  lower  than 
expected.  More  conclusive  investigation  with  holograms  of  images  rather  than  plane  waves  is 
currently  in  progress. 

IV.  SYSTEM  STUDIES 

In  order  to  exploit  the  advantage  of  photorefractive  holographic  storage  in  a  system,  we 
developed  the  Correlation  Matrix-Tensor  Multiplier  (CMTM).  Experiments  are  in  progress  to 
verify  the  CMTM  system  where  holographic  storage  capacity  of  SBN:60  is  used.  In  addition,  we 
have  begun  implementation  of  a  prototype  POEM  system  with  two  2x2  arrays  of  processing 
elements. 

Correlation  Matrix-Tensor  Multiplier; 

For  efficient  general  purpose  computing,  the  POEM  system  requires  reconfigurable 
crossbar  interconnection  of  the  processing  elements.  For  optimum  use  of  electronic  area,  input 
and  output  apertures,  and  interconnection  system  volume.  2-D  input  and  output  arrays  are  used. 
The  interconnection  between  2-D  input  and  2-D  output  arrays  requires  a  fourth-rank  ten.sor 
operation.  Towards  this  goal,  we  have  developed  the  coi^elation  matrix-tensor  multiplier 
(CMTM)  algorithm  16). 

Figure  3  illustrates  the  basic  concepts  of  the  CMTM  system.  The  matrix-ten.sor  multiplica¬ 
tion  can  be  achieved  using  a  correlation  of  an  N  xN  input  array  with  an  N  xN  array  of  N  xN 
subarrays.  By  phase-coding  the  input  and  tensor  arrays,  the  background  noise  can  be 
suppressed,  since  the  phase  codes  cancel  only  at  the  output  sites.  The  phase-coded  tensor  can  be 
compressed  to  reduce  its  size  and  bandwidth,  and  to  match  the  output  array  scale  to  that  of  the 
input.  This  reduction  in  size  and  bandwidth  is  made  at  the  cxpen.se  of  signal-to-noise  ratio  at  the 
output.  Our  theoretical  calculations  and  computer  simulations  show  that  the  average  SNR  scalc.s 
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as  (K/N)^F,  where  K  is  the  1-D  phase-code  density  and  F  is  the  average  fan-in  of  the  intercon¬ 
nection  pattern.  This  means  that  for  patterns  where  the  fan-in  is  proportional  to  the  number  of 
inputs  (as  seems  to  be  true  for  neural-network  systems),  the  SNR  is  determined  by  the  phase- 
code  density  only,  independent  of  the  array  size. 

We  h^ve  been  investigsting  experimental  implementations  of  the  CM  1 M  system  using  both 
photorefractive  crystals  and  corxiputer  generated  hologratns  (CGH).  The  CGH  experiment  will 
allow  ’’s  to  verify  the  CMTM  concept.  We  have  used  the  CMTM  simulation  to  generate  a  set  of 
test  patterns  with  predicted  outputs,  which  were  compared  to  experimental  results.  By  using 
photorefractive  volume  holography,  the  interconnection  can  be  made  either  by  continuously 
mixing  the  input  with  a  connection  image,  or  by  diffraction  from  one  of  several  tensors,  each 
previously  stored  as  a  single  complex  hologram. 

Prototype  POEM  Implementation: 

We  have  begun  implementation  of  a  prototype  POEM  system  consisting  of  two  planes  of  2 
x  2  arrays  in  a  folded  architecture.  Each  processing  plane  exists  on  a  single  CMOS  chip.  Each 
processing  element  has  optical  data  detectors,  while  the  chip  or>ntroller  has  control  and  clock 
detectors.  Each  chip  is  wire  bonded  to  a  PLZT  substrate  with  one  op'ical  modulator  for  each  PE. 
Electronic  output  is  available  to  monitor  each  PE  during  operation.  E-beam  fabricated  CGHs 
provide  a  butterfly  interconnection  between  the  planes.  Additional  CGHs  arc  .■.•>ed  to  couple 
light  through  the  modulators  and  to  focus  signal  beams  onto  the  detectors.  Laser  diodes  inter¬ 
faced  with  a  host  computer  transmit  cloc’;  and  control  signals  to  each  chip.  Data  input  is 
achieved  with  an  auxiliary  PLZT  SLM,  also  interfaced  with  the  host  computer,  while  data  output 
is  monitored  with  a  sii'con  detector  array.  A  micrograph  of  the  POEM  chip  and  the  .schematic 
optical  system  are  shown  in  Figures  4  and  5  respectively. 

In  this  implementation  we  use  multiple-SIMD  synchrony,  where  all  PEs  on  the  same  plane 
ooerate  in  synchrony,  but  c  ferent  instruction  streams  ^rre  sent  to  the  two  processing  array 
modules.  The  processing  eie..  :nt  is  a  simple  I -bit  processor  with  64  bits  of  random  access 
memory  and  three  registers;  a  gM-eral  purpose  register,  a  sleep  register  for  conditional  execution 
ano  a  carry  register.  There  are  thineen  instructions  to  perform  logic,  oata  movement,  conditionn’ 
execution,  and  I/O  operations.  The  processing  element  has  three  optical  data  deteciors.  Two  of 
these  detectors  receive  data  from  the  opposite  array,  while  the  third  detector  is  used  for  loading 
data. 


-7- 


V.  FUTURE  RESEARCH 


Several  design  and  implementation  issues  need  to  be  solved  before  the  realization  of  paral¬ 
lel  and  distributed  computing  systems.  Our  previous  implementation  efforts  have  shown  that 
approaches  and  architectures  based  on  conventional  electronic  technologies  will  not  result  in 
succesful  implementations  in  when  optoelectronic  technologies.  Hence,  there  is  a  need  to 
develop  technological  trends  and  system  concept  unque  to  optoelectronic  implementation.  Also, 
for  manufacturability  of  the  optically  interconnected  parallel  computing  systems,  we  need  to 
resolve  issues  related  to  packaging,  design  automation  and  testing  and  find  new  architectures 
which  exploit  fully  the  advantages  of  optical  interconnnections.  Hence,  for  a  continuation  of  our 
current  effort,  we  will  direct  our  future  studies  into  areas  such  as:  (a)  integration  of  many  elec¬ 
tronic  PEs  with  optical  interconnections  into  a  compact,  rugged  opto-electronic  package  of  a 
minimum  volume  or  size  (i.e.  opto-electronic  packaging),  (b)  computer  aided  design  and  layout 
of  opto-electronic  PE’s  and  optical  components  for  different  interconnection  netwoiics  (i.e. 
development  of  an  opto-electronic  CAD),  and  (c)  new  architectures  and  schemes  which  address 
the  reliability  (fault  tolerance)  of  the  distributed  muitiproc<  >.:or  computing  system  with  many 
opto-electronic  PEs  and  a  complex  interconnection  netwoiic  (d)  application  of  the  results  of  pre¬ 
vious  studies  to  combine  the  power  of  optoelectronic  architectures  for  creating  artificial  intelli¬ 
gence  with  neural  networks.  We  will  emphasire  the  development  of  synapse  arrays  by  combin¬ 
ing  electronic  processing  in  the  synapse  and  optical  interconnection  for  the  communication 
between  synapses. 
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Figure  Captions 


Figure  1:  The  POEM  architecture  consists  of  optoelectronic  process¬ 
ing  elements  linked  by  free-space,  reprogrammable  interconnections.  Pro¬ 
cessing  elements  consist  of  optical  detectors,  modulators,  and  electronic 
logic  circuitry.  Control  and  clock  signals  are  distributed  by  a  central  CGH. 
3-D  optical  memory  provides  parallel  data  input  to  the  system. 

Figure  2:  The  number  of  storable  holograms  calculated  as  a  function 
of  Eq.  The  values  shown  are  for  a  1  mm  thick  crystal  with  final  diffrac¬ 
tion  of  1%,  5%,  and  10%  for  each  of  th.-  superimposed  holograms.  The 
applied  field  increases  the  crystal’s  performance  by  increasing  the  index 
modulation  and  the  write  sensitivity. 

Figure  3:  (a)  Correlation  of  the  input  array  g(,x,  y)  with  the  tensor 
image  W*{x,y)  produces  the  connection  output  Cim.  The  outputs  are 
imbedded  in  a  field  of  noise,  (b)  By  phase-coding  the  input  and  tensor 
images,  the  background  noise  can  be  suppressed,  (c)  The  phase-coded 
tensor  image  can  be  compressed  to  reduce  its  size  and  bandwidth. '  The 
compression  reduces  the  output  SNR  by  an  amount  depending  on  the 
phase-code  density. 

Figure  4:  Micrograph  of  the  prototype  POEM  processing  element 
chip.  Each  chip  has  four  1-bit  processors  with  64  bit  RAM,  three  regis¬ 
ters,  and  logic  and  I/O  circuitry.  These  chips  are  wire-bonded  to  a  PLZT 
substrate  having  modulator  electrodes. 

Figure  5:  Optical  system  layout  for  prototype  POEM  system.  CGHs 
provide  a  butterfly  interconnection  between  two  planes  in  a  folded  archi¬ 
tecture.  Control  and  clock  signals  are  provided  by  IR  laser  diodes  inter¬ 
faced  with  a  host  computer. 
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Figure  5 


Photograph  showing  the  layout  of  the  multiprocessor  chip 


